Authors: Mauro Venticinque | Angelo Schillaci | Daniele Tambone

GitHub project: Bank-Marketing

Date: 2025-05-20

1 Introduction

In this project, we analyze data from a Portuguese banking institution’s direct marketing campaigns to identify key factors influencing customer subscription to term deposits.

A deposit account is a bank account maintained by a financial institution in which a customer can deposit and withdraw money. Deposit accounts can be savings accounts, current accounts or any of several other types of accounts explained below.

The dataset includes client demographics, previous campaign interactions, and economic indicators. Our goal is to develop insights that will enhance the effectiveness of future marketing strategies. By applying supervised learning techniques, we aim to predict customer responses and optimize outreach efforts for better engagement and conversion rates.

The report will begin with an Exploratory Data Analysis, examining the variables and their relationship with the target attribute (subscribed) to identify the most influential factors.

2 Exploratory Data Analysis

2.1 Variable descriptions

Bank client data:

  1. age (Integer): age of the customer
  2. job (Categorical): occupation
  3. marital (Categorical): marital status
  4. education (Categorical): education level
  5. default (Binary): has credit in default?
  6. housing (Binary): has housing loan?
  7. loan (Binary): has personal loan?
  8. contact (Categorical): contact communication type
  9. month (Categorical): last contact month of year
  10. day_of_week (Integer): last contact day of the week
  11. duration (Integer): last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model

Other attributes:

  1. campaign (Integer): number of contacts performed during this campaign and for this client (numeric, includes last contact)
  2. pdays (Integer): number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)
  3. previous (Integer): number of contacts performed before this campaign and for this client
  4. poutcome (Categorical): outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

Social and economic context attributes:

  1. emp.var.rate (Integer): employment variation rate - quarterly indicator
  2. cons.price.idx (Integer): consumer price index - monthly indicator
  3. cons.conf.idx (Integer): consumer confidence index - monthly indicator
  4. euribor3m (Integer): euribor 3 month rate - daily indicator
  5. nr.employed (Integer): number of employees - quarterly indicator

Output variable (desired target):

  1. subscribed (Binary): has the client subscribed a term deposit?

Source: UCI Machine Learning Repository

Note: In our dataset there isn’t the bank balance variable

More details

Data summary
Name train
Number of rows 32950
Number of columns 21
_______________________
Column type frequency:
character 11
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
job 0 1 6 13 0 12 0
marital 0 1 6 8 0 4 0
education 0 1 7 19 0 8 0
default 0 1 2 7 0 3 0
housing 0 1 2 7 0 3 0
loan 0 1 2 7 0 3 0
contact 0 1 8 9 0 2 0
month 0 1 3 3 0 10 0
day_of_week 0 1 3 3 0 5 0
poutcome 0 1 7 11 0 3 0
subscribed 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 40.04 10.45 17.00 32.00 38.00 47.00 98.00 ▅▇▃▁▁
duration 0 1 258.66 260.83 0.00 102.00 180.00 318.00 4918.00 ▇▁▁▁▁
campaign 0 1 2.57 2.77 1.00 1.00 2.00 3.00 43.00 ▇▁▁▁▁
pdays 0 1 961.90 188.33 0.00 999.00 999.00 999.00 999.00 ▁▁▁▁▇
previous 0 1 0.17 0.49 0.00 0.00 0.00 0.00 7.00 ▇▁▁▁▁
emp.var.rate 0 1 0.08 1.57 -3.40 -1.80 1.10 1.40 1.40 ▁▃▁▁▇
cons.price.idx 0 1 93.57 0.58 92.20 93.08 93.75 93.99 94.77 ▁▆▃▇▂
cons.conf.idx 0 1 -40.49 4.63 -50.80 -42.70 -41.80 -36.40 -26.90 ▅▇▁▇▁
euribor3m 0 1 3.62 1.74 0.63 1.34 4.86 4.96 5.04 ▅▁▁▁▇
nr.employed 0 1 5167.01 72.31 4963.60 5099.10 5191.00 5228.10 5228.10 ▁▁▃▁▇

The dataset includes 21 variables and 32,950 rows, with no missing values.
Categorical variables like job and education show good diversity, while default, loan, and housing have only 3 unique values.

Among numeric variables, age has a fairly normal distribution (mean ≈ 40, sd ≈ 10), while duration and pdays are highly skewed, with extreme values up to 4918 and 999 respectively.
Some variables (e.g., campaign, previous) have a low median but long tails, indicating that most observations are clustered at low values.
Macroeconomic variables such as emp.var.rate, euribor3m, and nr.employed are more stable, with tight interquartile ranges, suggesting consistent economic conditions during data collection.

2.2 Analysis of distributions

Firstly we see that this dataset are unbaleanced, with the majority of people that have not subscribed.

Correlation and Pairwise Relationships

Correlation Matrix
The correlation matrix reveals clear patterns among the numerical variables. Notably, euribor3m, nr.employed, and emp.var.rate are strongly positively correlated with each other, these suggest these variables capture similar information about the economic environment. This should be taken into account in predictive modeling, as using them together could lead to multicollinearity. In contrast, variables like campaign, pdays, and previous show very weak correlations with most other features, indicating they may contribute more independently to the model.

Scatterplot Matrix by Target
Several variables, such as duration and pdays, show highly skewed distributions, which could influence model performance and may benefit from transformations (e.g., log or binning).While some variables exhibit linear trends (e.g., euribor3m vs nr.employed), many scatterplots show dispersed or nonlinear patterns. This suggests that simple linear models may not fully capture the complexity in the data.

In certain plots, the blue points (subscribed) are concentrated in specific areas, showing the key factors that influenced successful subscriptions.

Distribution of Subscribed across Different Variable

Box plot of age
It is harder to see older people say no

Box plot of emp.var.rate
Text

Box plot of euribor3m
Text

Client data

Distribution of Age
The age distribution is right-skewed, with a peak around 30–40 years old. The proportion of people that have subscribed is higher among those over 60.This may be due to greater financial stability in older age groups.

Distribution of Job
The distribution of the occupation is not uniform, with the majority of people that are admin. The proportion of people that have subscribed is among the higest between all the occupation. This is probably due to the fact that people that are admin have a higher income and are more likely to subscribe. While student and retired people have a higher proportion of subscription, this explain that we saw in the previous plot that the older people and the people with higher education level are more likely to subscribe.

Distribution of Education
About Education Level, we can see that the distribution of the education level is not uniform, with the majority of people that have a university degree. The proportion of people that have a university degree and that have subscribed is among the higest between all the education level. This is probably due to the fact that people that have a university degree have a higher income and are more likely to subscribe.

Distribution of Marital status
Text.

Distribution of Contact
Text.

Previous Campaign Data

Distribution of Contacts
About previous campaign, while most clients were not previously contacted, the success rate is visibly higher among those who were previously contacted more than once or had a successful prior outcome. This suggests that prior engagement is positively associated with subscription, but they are a small part of sample.

Temporal data

Distribution of Days of Week
The distribution of the last contact day of the week is uniform, with the majority of people that have been contacted on Thursday. The proportion of people that have subscribed is among the higest when the last contact day of the week is on the middle of week.

Distribution of Months
Instead, the distribution of the last contact month of the year is not uniform, with the majority of people that have been contacted in May. The proportion of people that have subscribed is among the higest when the last contact month of the year is in March, December, September and October. This is probably due to the fact that people are more likely to subscribe when they have more money and not during the summer.

Distribution of Duration
The duration of the last contact is right-skewed, with a peak around 0-100 seconds. The proportion of people that have subscribed is higher among people that have been contacted for a longer duration. This is probably due to the fact that people that have been contacted for a longer duration are more interested to subscribe.

Social and economic data

Distribution of Employment Variation
The distribution of the employment variation rate is not uniform, with the majority of people that have a positive or zero employment variation rate. The proportion of people that have subscribed is among the higest when the employment variation rate is negative. This is probably due to the fact that people are more propensity to subscribe when they are in recession.

Distribution of Days of Consumer Price Index
The proportion of people that have subscribed is higher when the CPI is lower than 93. This is probably due to the fact that people when the CPI is lower have more money and are more likely to subscribe.

Distribution of Consumer Confidence Index
The proportion of people that have subscribed is higher when the consumer confidence index is higher than -40. This is probably due to the fact that people when the consumer confidence index is higher have more money and have more propensity to subscribe.

Distribution of Euribor 3 month rate
When considering the Euribor rate, one might think that a lower Euribor would result in a decline in savings rate since most European banks align their deposit interest rate offers with ECB indexes, particularly with the three month Euribor. Still, as we see, this plot shows the opposite, with a lower Euribor corresponding to a higher probability for deposit subscription, and the same probability decreasing along with the increase of the three month Euribor.

2.3 Conclusion

The Exploratory Data Analysis reveals several important insights into the factors that influence the likelihood of subscription in this dataset. Below there is a summary of the key findings:

  • The dataset is unbalanced, with the majority of contacted individuals not subscribing.
  • Both younger and older individuals exhibit a higher likelihood of subscribing compared to those in middle age.
  • Socio-demographic factors, such as education and jobs, appear to influence subscription rates, for example, individuals in administrative roles and those with higher education levels tend to subscribe more often.
  • Prior interaction with the campaign, especially repeated contacts or past successful outcomes, is positively associated with subscription.
  • Subscription rates vary by month, with peaks in March, December, September, and October. Additionally, longer call durations are linked to a higher likelihood of subscription.
  • All economic variables examined show significant associations with subscription. Specifically, lower CPI, a negative employment variation rate, and higher CCI are correlated with increased subscription rates.

In summary, the analysis suggests that financial conditions, previous campaign interactions, and macroeconomic indicators are strong predictors of subscription behavior. Demographic factors such as age, occupation, and education level also contribute meaningfully to the outcome.

In the next section, we will use these EDA findings to conduct a preliminary skim of the most influential variables, based on the visual trends observed in the plots.

3 Model selection

3.1 Preprocessing

Based on the Exploratory Data Analysis (EDA), we selected only the most relevant variables.

With a view to training the model, we apply one-hot encoding. We obtain the following dataset:

21 Variable
Variable Type
age int
single bool
cellular bool
low_call bool
previous int
negative_emp bool
low_cpi bool
high_cci bool
low_euribor bool
university bool
p_course bool
job_student bool
job_retired bool
job_admin bool
month_sep bool
month_oct bool
month_dec bool
month_mar bool
p_failure bool
p_success bool
target bool

3.2 STEPWISE selection

full_model <- glm(target ~ ., data = full_df, family = binomial)
stepwise <- stepAIC(full_model, direction = "both", trace = FALSE)
vif(stepwise)
##       single     cellular     low_call     previous negative_emp      low_cpi 
##     1.161049     1.257740     1.015602     4.030389     4.797421     1.673660 
##     high_cci  low_euribor   university     p_course  job_student  job_retired 
##     1.356434     5.001553     1.257091     1.123516     1.172201     1.114729 
##    job_admin    month_sep    month_oct    month_dec    month_mar    p_failure 
##     1.209488     1.109976     1.074431     1.051192     1.039553     2.885715 
##    p_success 
##     2.653624
# predictore removed by Stepwise
stepwise$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## target ~ age + single + cellular + low_call + previous + negative_emp + 
##     low_cpi + high_cci + low_euribor + emp_cat + university + 
##     p_course + job_student + job_retired + job_admin + month_sep + 
##     month_oct + month_dec + month_mar + p_failure + p_success
## 
## Final Model:
## target ~ single + cellular + low_call + previous + negative_emp + 
##     low_cpi + high_cci + low_euribor + university + p_course + 
##     job_student + job_retired + job_admin + month_sep + month_oct + 
##     month_dec + month_mar + p_failure + p_success
## 
## 
##        Step Df    Deviance Resid. Df Resid. Dev      AIC
## 1                              32929   18541.01 18583.01
## 2 - emp_cat  0 0.000000000     32929   18541.01 18583.01
## 3     - age  1 0.009056746     32930   18541.02 18581.02
stepwise_formula <- target ~ single + cellular + low_call + previous + negative_emp + 
  low_cpi + high_cci + low_euribor + university + p_course + 
  job_student + job_retired + job_admin + month_sep + month_oct + 
  month_dec + month_mar + p_failure + p_success

3.3 LASSO selection

set.seed(123)

df_no_target <- subset(full_df, select = -target)

fit_lasso <- glmnet(x = as.matrix(df_no_target),
                    y = target,
                    alpha = 1,
                    family = "binomial",
)
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
cv_fit <- cv.glmnet(
  x = as.matrix(df_no_target),
  y = target,
  alpha = 1,
  family = "binomial"
)
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
plot(cv_fit)

# predictors selected by Lasso
coef(cv_fit, s = "lambda.1se")
## 22 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  -3.40260756
## age           .         
## single        .         
## cellular      0.26429155
## low_call      .         
## previous      .         
## negative_emp  0.33148264
## low_cpi      -0.15716970
## high_cci      0.54539233
## low_euribor   1.22932991
## emp_cat       .         
## university    0.02599663
## p_course      .         
## job_student   0.27874368
## job_retired   0.29253325
## job_admin     .         
## month_sep     0.25751717
## month_oct     0.59082454
## month_dec     0.28825879
## month_mar     0.96303668
## p_failure    -0.12024151
## p_success     1.54745367
lasso_formula <- target ~ cellular + negative_emp + 
  low_cpi + high_cci + low_euribor + university + 
  job_student + job_retired + month_sep + month_oct + 
  month_dec + month_mar + p_failure + p_success

lasso_mod<-glm(lasso_formula, data=full_df, family=binomial)

3.4 Comperison

summary(stepwise)
## 
## Call:
## glm(formula = target ~ single + cellular + low_call + previous + 
##     negative_emp + low_cpi + high_cci + low_euribor + university + 
##     p_course + job_student + job_retired + job_admin + month_sep + 
##     month_oct + month_dec + month_mar + p_failure + p_success, 
##     family = binomial, data = full_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.12030    0.10037 -41.052  < 2e-16 ***
## single        0.08955    0.04463   2.007  0.04477 *  
## cellular      0.48749    0.05437   8.966  < 2e-16 ***
## low_call      0.28095    0.09007   3.119  0.00181 ** 
## previous      0.11396    0.05858   1.946  0.05171 .  
## negative_emp  0.54235    0.09230   5.876 4.20e-09 ***
## low_cpi      -0.55291    0.05180 -10.674  < 2e-16 ***
## high_cci      0.70524    0.04530  15.568  < 2e-16 ***
## low_euribor   1.40709    0.09007  15.622  < 2e-16 ***
## university    0.13424    0.04578   2.932  0.00336 ** 
## p_course      0.10243    0.06164   1.662  0.09655 .  
## job_student   0.48616    0.10267   4.735 2.19e-06 ***
## job_retired   0.52247    0.07832   6.671 2.54e-11 ***
## job_admin     0.10922    0.04752   2.298  0.02153 *  
## month_sep     0.32774    0.10954   2.992  0.00277 ** 
## month_oct     0.77146    0.09676   7.973 1.55e-15 ***
## month_dec     0.79355    0.18087   4.387 1.15e-05 ***
## month_mar     1.12505    0.10898  10.323  < 2e-16 ***
## p_failure    -0.58715    0.09471  -6.199 5.67e-10 ***
## p_success     1.26651    0.11283  11.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23199  on 32949  degrees of freedom
## Residual deviance: 18541  on 32930  degrees of freedom
## AIC: 18581
## 
## Number of Fisher Scoring iterations: 6
summary(lasso_mod)
## 
## Call:
## glm(formula = lasso_formula, family = binomial, data = full_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.82821    0.05730 -66.814  < 2e-16 ***
## cellular      0.50192    0.05423   9.255  < 2e-16 ***
## negative_emp  0.56272    0.09214   6.108 1.01e-09 ***
## low_cpi      -0.57685    0.05098 -11.314  < 2e-16 ***
## high_cci      0.73139    0.04488  16.296  < 2e-16 ***
## low_euribor   1.42629    0.08966  15.907  < 2e-16 ***
## university    0.15292    0.04187   3.652  0.00026 ***
## job_student   0.50976    0.09664   5.275 1.33e-07 ***
## job_retired   0.46140    0.07643   6.037 1.57e-09 ***
## month_sep     0.33778    0.10917   3.094  0.00197 ** 
## month_oct     0.78992    0.09678   8.162 3.30e-16 ***
## month_dec     0.77731    0.18053   4.306 1.66e-05 ***
## month_mar     1.14224    0.10900  10.479  < 2e-16 ***
## p_failure    -0.44660    0.05939  -7.519 5.51e-14 ***
## p_success     1.43643    0.07412  19.379  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23199  on 32949  degrees of freedom
## Residual deviance: 18567  on 32935  degrees of freedom
## AIC: 18597
## 
## Number of Fisher Scoring iterations: 6
# Compare the models
stepwise_results <- k_fold_mod(data = full_df, target_col = "target", model_formula = stepwise)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
lasso_results    <- k_fold_mod(data = full_df, target_col = "target", model_formula = lasso_mod)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
print(stepwise_results)
## $Accuracy_at_best_threshold
## [1] "0.8991 (threshold = 0.53)"
## 
## $F1_at_best_threshold
## [1] "0.4832 (threshold = 0.22)"
## 
## $Sensitivity
## [1] 0.538
## 
## $Specificity
## [1] 0.9127
## 
## $AUC
## [1] 0.7811
## 
## $AIC
## [1] 16724.85
print(lasso_results)
## $Accuracy_at_best_threshold
## [1] "0.899 (threshold = 0.5)"
## 
## $F1_at_best_threshold
## [1] "0.4855 (threshold = 0.21)"
## 
## $Sensitivity
## [1] 0.5374
## 
## $Specificity
## [1] 0.9144
## 
## $AUC
## [1] 0.7796
## 
## $AIC
## [1] 16738.96
# Threshold evaluation
probs_stepwise <- predict(stepwise, type = "response")
probs_lasso    <- predict(lasso_mod, type = "response")

res_step_05 <- evaluate_threshold(probs_stepwise, target, 0.5)
res_step_02 <- evaluate_threshold(probs_stepwise, target, 0.2)

res_lasso_05 <- evaluate_threshold(probs_lasso, target, 0.5)
res_lasso_02 <- evaluate_threshold(probs_lasso, target, 0.2)


# Unisci tutti i risultati in una lista
results_list <- list(
  Stepwise_0.5 = res_step_05,
  Stepwise_0.2 = res_step_02,
  LASSO_0.5    = res_lasso_05,
  LASSO_0.2    = res_lasso_02
)

# Trasforma in data.frame
results_df <- do.call(rbind, lapply(names(results_list), function(name) {
  res <- results_list[[name]]
  model <- sub("_.*", "", name)
  threshold <- res$Threshold
  data.frame(
    Model       = model,
    Threshold   = threshold,
    Accuracy    = res$Accuracy,
    F1          = res$F1,
    Sensitivity = res$Sensitivity,
    Specificity = res$Specificity
  )
}))

# Visualizza il risultato
print(results_df)
##      Model Threshold Accuracy     F1 Sensitivity Specificity
## 1 Stepwise       0.5   0.8990 0.3182      0.2093      0.9865
## 2 Stepwise       0.2   0.8633 0.4833      0.5673      0.9009
## 3    LASSO       0.5   0.8990 0.3216      0.2126      0.9861
## 4    LASSO       0.2   0.8528 0.4676      0.5735      0.8883

3.5 LDA

3.5.1 Models

## Warning in lda.default(x, grouping, ...): variables are collinear

3.5.2 CV

results_lda_full <- cv_lda_eval(full_df, target ~ ., model_name = "LDA_Full", k = 10)
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
results_lda_stepwise <- cv_lda_eval(full_df,stepwise_formula, model_name = "LDA_Stepwise", k = 10)

results_lda_lasso <- cv_lda_eval(full_df, lasso_formula, model_name = "LDA_LASSO", k = 10)

3.5.3 Evaluate

lda_results <- rbind(results_lda_full, results_lda_stepwise, results_lda_lasso)

aggregate(cbind(Accuracy, F1, Sensitivity, Specificity) ~ Model + Optimized_For, data = lda_results, mean)
##          Model Optimized_For Accuracy      F1 Sensitivity Specificity
## 1     LDA_Full      Accuracy  0.89993 0.30377     0.19681     0.98907
## 2    LDA_LASSO      Accuracy  0.89984 0.29692     0.18898     0.99008
## 3 LDA_Stepwise      Accuracy  0.89999 0.30326     0.19630     0.98921
## 4     LDA_Full            F1  0.86630 0.47438     0.53193     0.90867
## 5    LDA_LASSO            F1  0.86551 0.47189     0.53333     0.90780
## 6 LDA_Stepwise            F1  0.86541 0.47439     0.53515     0.90725
aggregate(Threshold ~ Model + Optimized_For, data = lda_results, mean)
##          Model Optimized_For Threshold
## 1     LDA_Full      Accuracy     0.867
## 2    LDA_LASSO      Accuracy     0.888
## 3 LDA_Stepwise      Accuracy     0.865
## 4     LDA_Full            F1     0.159
## 5    LDA_LASSO            F1     0.161
## 6 LDA_Stepwise            F1     0.156

3.5.4 Best LDA

print(lda_lasso_model)
## Call:
## lda(lasso_formula, data = full_df)
## 
## Prior probabilities of groups:
##         0         1 
## 0.8873445 0.1126555 
## 
## Group means:
##    cellular negative_emp   low_cpi  high_cci low_euribor university job_student
## 0 0.6122512    0.3732471 0.1958410 0.4123743   0.2781654  0.2894179  0.01631439
## 1 0.8332435    0.7685884 0.3922414 0.5660022   0.7206358  0.3617996  0.06115302
##   job_retired  month_sep  month_oct   month_dec   month_mar  p_failure
## 0  0.03543334 0.00844791 0.01084205 0.002667761 0.007250838 0.09990423
## 1  0.09671336 0.05711207 0.06627155 0.019935345 0.059267241 0.13011853
##    p_success
## 0 0.01347561
## 1 0.19585129
## 
## Coefficients of linear discriminants:
##                     LD1
## cellular      0.2532871
## negative_emp  0.2779495
## low_cpi      -0.7380515
## high_cci      0.4582446
## low_euribor   1.3869866
## university    0.1056124
## job_student   0.6428778
## job_retired   0.5251612
## month_sep     0.9114776
## month_oct     1.3410837
## month_dec     1.5371912
## month_mar     1.9147897
## p_failure    -0.4066345
## p_success     2.8641107
probs <- predict(lda_lasso_model)$posterior[, "1"]
pred_class <- ifelse(probs > 0.16, 1, 0)
table(Predicted = pred_class, Actual = full_df$target)
##          Actual
## Predicted     0     1
##         0 26388  1813
##         1  2850  1899
evaluate_threshold(probs, full_df$target, threshold = 0.16)
## $Threshold
## [1] 0.16
## 
## $Accuracy
## [1] 0.8585
## 
## $F1
## [1] 0.4489
## 
## $Sensitivity
## [1] 0.5116
## 
## $Specificity
## [1] 0.9025

3.5.5 ROC Curve

roc_obj <- roc(full_df$target, probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, main = "ROC Curve for LDA LASSO Model")

auc(roc_obj)
## Area under the curve: 0.7778

3.6 QDA

3.6.1 Models

3.6.2 CV

#results_qda_full     <- cv_qda_eval(full_df, target ~ ., model_name = "QDA_Full", k = 10)
#results_qda_stepwise <- cv_qda_eval(full_df, stepwise_formula, model_name = "QDA_Stepwise", k = 10)
#results_qda_lasso    <- cv_qda_eval(full_df, lasso_formula, model_name = "QDA_LASSO", k = 10)

3.6.3 Evaluate

#qda_results <- rbind(results_qda_full, results_qda_stepwise, results_qda_lasso)

#aggregate(cbind(Accuracy, F1, Sensitivity, Specificity) ~ Model + Optimized_For, data = qda_results, mean)
#aggregate(Threshold ~ Model + Optimized_For, data = qda_results, mean)

3.6.4 Best QDA

#print(qda_lasso_model)

#probs_qda <- predict(qda_lasso_model)$posterior[, "1"]
#pred_class_qda <- ifelse(probs_qda > 0.4, 1, 0)
#table(Predicted = pred_class_qda, Actual = full_df$target)

#evaluate_threshold(probs_qda, full_df$target, threshold = 0.4)

3.6.5 ROC Curve

#roc_qda <- roc(full_df$target, probs_qda)
#plot(roc_qda, main = "ROC Curve for QDA LASSO Model")
#auc(roc_qda)

3.7 Random Forest

df_for_Tree <- full_df
df_for_Tree$target <- factor(ifelse(full_df$target == 1, "Yes", "No"))

train_index <- createDataPartition(df_for_Tree$target, p = 0.8, list = FALSE)

train_set <- df_for_Tree[train_index, ]
test_set  <- df_for_Tree[-train_index, ]
set.seed(123)
treeDf <- tree(target ~ ., train_set)
## Warning in tree(target ~ ., train_set): NAs introduced by coercion
summary(treeDf)
## 
## Classification tree:
## tree(formula = target ~ ., data = train_set)
## Variables actually used in tree construction:
## [1] "low_euribor" "high_cci"    "p_success"  
## Number of terminal nodes:  4 
## Residual mean deviance:  0.5752 = 15160 / 26360 
## Misclassification error rate: 0.1014 = 2674 / 26361
plot(treeDf)
text(treeDf, pretty = 0)

#cvDf <- cv.tree(treeDf)
#cvDf
#plot(cvDf$size, cvDf$dev , type = "b")
set.seed(123)
rf_train <- randomForest(target ~ ., train_set, mtry = 7, importance = TRUE)
rf_train
## 
## Call:
##  randomForest(formula = target ~ ., data = train_set, mtry = 7,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 10.35%
## Confusion matrix:
##        No Yes class.error
## No  22906 485  0.02073447
## Yes  2244 726  0.75555556
tree_pred <- predict(rf_train, newdata = test_set)
table(tree_pred, test_set$target)
##          
## tree_pred   No  Yes
##       No  5723  558
##       Yes  124  184
importance(rf_train)
##                       No         Yes MeanDecreaseAccuracy MeanDecreaseGini
## age           25.2956110  12.1292203           31.3416991        506.93156
## single        23.0065081  -6.4563694           19.7977154         58.06339
## cellular      19.3375782  18.0844678           30.3248184         59.53520
## low_call       8.7337941   7.5815638           12.8268986         33.92329
## previous       7.1033205   7.0518287            8.6774367        108.54280
## negative_emp  11.9843440  10.2966356           13.0175552         65.60228
## low_cpi       61.2900304 -53.8018408           45.7776987         66.29689
## high_cci      39.2888282  38.7908040           46.9895493        281.63954
## low_euribor   15.5742123  23.4782191           23.0223018        199.56946
## emp_cat       12.3823131  11.8823441           13.8186300         78.34894
## university     8.6464326   0.6008189            9.0141606         66.04441
## p_course       1.6225574   6.3389898            5.1957953         47.67366
## job_student   12.0673720  16.4276156           19.2978405         25.59045
## job_retired    0.5329116  -1.2010725           -0.1877720         25.25449
## job_admin     -1.1612488   9.0321440            5.0966691         59.93971
## month_sep      2.3396265  -1.1599812            1.5764949         29.91551
## month_oct     11.7808716  33.6707759           19.0857578         52.32859
## month_dec    -16.4558116  19.7569315            0.9232797         16.22967
## month_mar     -7.5072834  27.6167143            6.2854111         54.33978
## p_failure     12.6180629 -16.2779556           10.4101050         37.35659
## p_success      5.8042644  57.9513826           35.2105357        316.75321
varImpPlot(rf_train)

#x <- full_df[, !(names(df_for_Tree) %in% "target")]
#y <- full_df$target

#xtrain <- x[train_index, ]
#ytrain <- y[train_index]

#xtest <- x[-train_index, ]
#ytest <- y[-train_index]

#set.seed(123)
#bartfit <- lbart(xtrain, ytrain, x.test = xtest)
#yhat_bart <- bartfit$prob.test.mean
#table(yhat_bart, test_set$target)
#table(pred = yhat_bart > 0.5, actual = ytest)